7 research outputs found
Interpretable Transformations with Encoder-Decoder Networks
Deep feature spaces have the capacity to encode complex transformations of
their input data. However, understanding the relative feature-space
relationship between two transformed encoded images is difficult. For instance,
what is the relative feature space relationship between two rotated images?
What is decoded when we interpolate in feature space? Ideally, we want to
disentangle confounding factors, such as pose, appearance, and illumination,
from object identity. Disentangling these is difficult because they interact in
very nonlinear ways. We propose a simple method to construct a deep feature
space, with explicitly disentangled representations of several known
transformations. A person or algorithm can then manipulate the disentangled
representation, for example, to re-render an image with explicit control over
parameterized degrees of freedom. The feature space is constructed using a
transforming encoder-decoder network with a custom feature transform layer,
acting on the hidden representations. We demonstrate the advantages of explicit
disentangling on a variety of datasets and transformations, and as an aid for
traditional tasks, such as classification.Comment: Accepted at ICCV 201
Self-Supervised Monocular Depth Hints
Monocular depth estimators can be trained with various forms of
self-supervision from binocular-stereo data to circumvent the need for
high-quality laser scans or other ground-truth data. The disadvantage, however,
is that the photometric reprojection losses used with self-supervised learning
typically have multiple local minima. These plausible-looking alternatives to
ground truth can restrict what a regression network learns, causing it to
predict depth maps of limited quality. As one prominent example, depth
discontinuities around thin structures are often incorrectly estimated by
current state-of-the-art methods.
Here, we study the problem of ambiguous reprojections in depth prediction
from stereo-based self-supervision, and introduce Depth Hints to alleviate
their effects. Depth Hints are complementary depth suggestions obtained from
simple off-the-shelf stereo algorithms. These hints enhance an existing
photometric loss function, and are used to guide a network to learn better
weights. They require no additional data, and are assumed to be right only
sometimes. We show that using our Depth Hints gives a substantial boost when
training several leading self-supervised-from-stereo models, not just our own.
Further, combined with other good practices, we produce state-of-the-art depth
predictions on the KITTI benchmark.Comment: Accepted to ICCV 201
Modeling Object Appearance using Context-Conditioned Component Analysis
Subspace models have been very successful at modeling
the appearance of structured image datasets when the visual objects have been aligned in the images (e.g., faces).
Even with extensions that allow for global transformations or dense warps of the image, the set of visual objects whose appearance may be modeled by such methods is limited.
They are unable to account for visual objects where occlusion leads to changing visibility of different object parts (without a strict layered structure) and where a one-toone mapping between parts is not preserved. For example bunches of bananas contain different numbers of bananas but each individual banana shares an appearance subspace.
In this work we remove the image space alignment limitations of existing subspace models by conditioning the models on a shape dependent context that allows for the complex, non-linear structure of the appearance of the visual object to be captured and shared. This allows us to exploit the advantages of subspace appearance models with non-rigid, deformable objects whilst also dealing with complex occlusions and varying numbers of parts. We demonstrate the effectiveness of our new model with examples of structured inpainting and appearance transfer
Two-View Geometry Scoring Without Correspondences
Camera pose estimation for two-view geometry traditionally relies on RANSAC.
Normally, a multitude of image correspondences leads to a pool of proposed
hypotheses, which are then scored to find a winning model. The inlier count is
generally regarded as a reliable indicator of "consensus". We examine this
scoring heuristic, and find that it favors disappointing models under certain
circumstances. As a remedy, we propose the Fundamental Scoring Network (FSNet),
which infers a score for a pair of overlapping images and any proposed
fundamental matrix. It does not rely on sparse correspondences, but rather
embodies a two-view geometry model through an epipolar attention mechanism that
predicts the pose error of the two images. FSNet can be incorporated into
traditional RANSAC loops. We evaluate FSNet on fundamental and essential matrix
estimation on indoor and outdoor datasets, and establish that FSNet can
successfully identify good poses for pairs of images with few or unreliable
correspondences. Besides, we show that naively combining FSNet with MAGSAC++
scoring approach achieves state of the art results
Single Image Depth Prediction with Wavelet Decomposition
International audienceWe present a novel method for predicting accurate depths from monocular images with high efficiency. This optimal efficiency is achieved by exploiting wavelet decomposition, which is integrated in a fully differentiable encoder-decoder architecture. We demonstrate that we can reconstruct high-fidelity depth maps by predicting sparse wavelet coefficients. In contrast with previous works, we show that wavelet coefficients can be learned without direct supervision on coefficients. Instead we supervise only the final depth image that is reconstructed through the inverse wavelet transform. We additionally show that wavelet coefficients can be learned in fully self-supervised scenarios, without access to ground-truth depth. Finally, we apply our method to different state-of-the-art monocular depth estimation models, in each case giving similar or better results compared to the original model, while requiring less than half the multiplyadds in the decoder network